Skip to content

JAVA-5949 preserve connection pool on backpressure errors when establishing connections#1900

Open
nhachicha wants to merge 33 commits intomongodb:backpressurefrom
nhachicha:nh/backpressure/preserve_connection_pool
Open

JAVA-5949 preserve connection pool on backpressure errors when establishing connections#1900
nhachicha wants to merge 33 commits intomongodb:backpressurefrom
nhachicha:nh/backpressure/preserve_connection_pool

Conversation

@nhachicha
Copy link
Copy Markdown
Collaborator

@nhachicha nhachicha commented Feb 25, 2026

Original PR #1854 accidentally closed, and had no outstanding review comments.

we should merge the latter into main, then merge this PR in backpressure, and leave the necessary TODO comments in #1918 related to the dependency on #1856.

  • then, when there are no outstanding PRs for the backpressure branch, rebase backpressure on top of main to include the changes from the main (see Merge main into backpressure #1917 (comment) for more context on this process).
  • then address the aforementioned TODO comments.

Specification changes:

Note that there was a change in the description of one of the tests that belongs to this PR, but I closed the corresponding JAVA-6095 (see this comment): DRIVERS-3394: adjust spec test description (#1894).

JAVA-5949, JAVA-6056, JAVA-5664

AI usage and effectiveness:

  • Model Opus 4.6 then 4.7
  • Used as an implementer + paired reviewer
  • Spec audits
  • Most architectural decisions were grounded in "what do Node and .NET do?" in addition to the reasoning from the spec text.

@nhachicha nhachicha self-assigned this Feb 25, 2026
@nhachicha nhachicha requested a review from vbabanin February 25, 2026 16:56
@nhachicha nhachicha marked this pull request as ready for review February 25, 2026 16:56
@nhachicha nhachicha requested a review from a team as a code owner February 25, 2026 16:56
@codeowners-service-app
Copy link
Copy Markdown

Assigned stIncMale for team dbx-java because vbabanin is out of office.

@stIncMale
Copy link
Copy Markdown
Member

Added the following to the description of the PR (a new piece of work to be done within the PR): DRIVERS-3394: adjust spec test description (#1894)(JAVA-6095).

// clear the pool as they're not related to overload.
// TLS configuration errors (certificate validation, protocol mismatches) should also clear the pool
// as they indicate configuration issues, not server overload.
if (beforeHandshake && !sdamIssue.relatedToAuth() && !sdamIssue.relatedToTlsConfigurationError()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we attach SystemOverloadedError and RetryableError labels in the SDAM error-handling path (effectively only for DefaultServer). In load-balanced mode, SDAM isn’t involved: the LB code path invalidates the pool directly (e.g., connectionPool.invalidate(serviceId, generation)), so the labeling logic is bypassed.

This means users running the driver in LB mode (behind an NLB) can still hit network errors, TLS handshake failures, timeouts during connection establishment or hello, but won’t get the labels.

However, these labels are a CMAP requirement, not SDAM. The CMAP spec states: “The pool MUST add the error labels SystemOverloadedError and RetryableError to network errors or network timeouts it encounters during the connection establishment or the hello message.”

Since this is defined as a pool behavior (topology-agnostic), it seems we should implement the labeling in the connection pool layer so it applies consistently in both default and load-balanced modes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SDAM/backpressure handling so that connection-establishment failures caused by server-side backpressure do not clear the connection pool, and adds/updates tests and matchers to align with the SDAM spec test expectations.

Changes:

  • Adjust SDAM exception handling to avoid invalidating the pool on certain pre-handshake network failures, and add TLS-configuration-error classification.
  • Add new error label constants (SystemOverloadedError, RetryableError) and apply them for backpressure-related failures.
  • Extend unified event matching for additional server types and add a prose test for connection pool backpressure behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
driver-sync/src/test/functional/com/mongodb/client/unified/EventMatcher.java Extends server type matching for SDAM serverDescriptionChangedEvent expectations.
driver-sync/src/test/functional/com/mongodb/client/ServerDiscoveryAndMonitoringProseTests.java Adds a prose test intended to validate connection pool backpressure behavior via server rate limiter parameters.
driver-core/src/test/unit/com/mongodb/internal/connection/DefaultServerSpecification.groovy Updates unit tests to expect pool preservation on network errors during connection establishment (sync/async).
driver-core/src/main/com/mongodb/internal/connection/SdamServerDescriptionManager.java Adds TLS configuration error detection helper on SDAM issues.
driver-core/src/main/com/mongodb/internal/connection/DefaultSdamServerDescriptionManager.java Changes SDAM exception handling to preserve pool/server description for certain pre-handshake failures and apply labels.
driver-core/src/main/com/mongodb/MongoException.java Introduces new public error label constants used by the backpressure handling path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread driver-core/src/main/com/mongodb/MongoException.java
Copy link
Copy Markdown
Member

@stIncMale stIncMale Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]

The current PR seemingly depends on #1856 (see the description of the current PR). We need to decide what to do with that.

P.S. I originally left my thoughts in the description of this PR, but I don't know if that's what we are going to do.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer depend on this PR see #1900 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]

@nhachicha

  • Could you please confirm that all the specification changes listed in the "Specification changes" part of the description of the current PR have been addressed in the PR?
  • If they have been addressed, let's update the description correspondingly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description updated. All specs are implemented

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]

JAVA-5949 has an old comment, which instructs to re-enable some tests that were previously disabled when we updated the testing/resources/specifications submodule in main. Those tests were not enabled. Let's enable them and make sure they pass.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR no longer depends on #1856. Spec PR mongodb/specifications#1880 rewrote the backpressure-network-*-fail.yml tests to wait on serverDescriptionChangedEvent instead of serverHeartbeatSucceededEvent.

@nhachicha nhachicha force-pushed the nh/backpressure/preserve_connection_pool branch from 0be0250 to bce97d4 Compare April 22, 2026 17:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +74 to +100
Throwable cause = t.getCause();
while (cause != null) {
if (cause instanceof CertificateException
|| cause instanceof CertPathBuilderException
|| cause instanceof CertPathValidatorException
|| cause instanceof SSLPeerUnverifiedException
|| cause instanceof SSLProtocolException) {
return true;
}
if (cause instanceof SSLHandshakeException) {
String message = cause.getMessage();
if (message != null) {
String lowerMessage = message.toLowerCase(Locale.ROOT);
if (lowerMessage.contains("certificate")
|| lowerMessage.contains("verify")
|| lowerMessage.contains("trust")
|| lowerMessage.contains("hostname")
|| lowerMessage.contains("protocol")
|| lowerMessage.contains("cipher")
|| lowerMessage.contains("handshake_failure")) {
return true;
}
}
}
cause = cause.getCause();
}
return false;
Copy link
Copy Markdown
Member

@vbabanin vbabanin Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into isTlsConfigurationError and the TLS exception landscape across JDK versions and providers. Sharing what I found below and what we discussed over the call.


Current TLS config error gaps

The keyword list (certificate, verify, trust, hostname, protocol, cipher, handshake_failure) catches 12 of the 25 RFC-defined TLS alert descriptions but misses these config errors.

Alert JDK exception message Labeled?
unknown_ca(48) "Received fatal alert: unknown_ca" Yes -- false postive
access_denied(49) "Received fatal alert: access_denied" Yes -- *false positive *
insufficient_security(71) "Received fatal alert: insufficient_security" Yes -- false positive
unrecognized_name(112) "Received fatal alert: unrecognized_name" Yes -- false positive

BouncyCastle's JSSE provider is currently unhandled: it produces TlsFatalAlertReceived (extends IOException, not SSLException), however, there is no good way to catch those.

How FIN, RST, and TLS alerts surface across providers and transports

Full exception chains (42 test cases)

FIN (graceful close during TLS handshake):

Transport Provider Exception chain
SocketStream JDK MongoSocketWriteException -> SSLHandshakeException("Remote host terminated the handshake") -> EOFException("SSL peer shut down incorrectly")
NettyStream JDK MongoSocketWriteException -> ClosedChannelException
TlsChannelStream JDK MongoSocketWriteException -> ClosedChannelException
SocketStream BC MongoSocketWriteException -> TlsFatalAlert("handshake_failure(40)")
NettyStream BC MongoSocketWriteException -> ClosedChannelException
TlsChannelStream BC MongoSocketWriteException -> ClosedChannelException

RST (abrupt reset during TLS handshake):

Transport Provider Exception chain
SocketStream JDK MongoSocketWriteException -> SocketException("Connection reset")
NettyStream JDK MongoSocketWriteException -> SslClosedEngineException("SSLEngine closed already")
TlsChannelStream JDK MongoSocketWriteException -> SocketException("Connection reset")
SocketStream BC MongoSocketWriteException -> SocketException("Connection reset")
NettyStream BC MongoSocketWriteException -> SslClosedEngineException("SSLEngine closed already")
TlsChannelStream BC MongoSocketWriteException -> SocketException("Connection reset")

TLS alert example -- unknown_ca(48):

Transport Provider Exception chain
SocketStream JDK MongoSocketWriteException -> SSLHandshakeException("Received fatal alert: unknown_ca")
NettyStream JDK MongoSocketWriteException -> SSLHandshakeException("Received fatal alert: unknown_ca")
TlsChannelStream JDK MongoSocketWriteException -> SSLHandshakeException("Received fatal alert: unknown_ca")
SocketStream BC MongoSocketWriteException -> TlsFatalAlertReceived("unknown_ca(48)")
NettyStream BC MongoSocketWriteException -> SSLException -> TlsFatalAlertReceived("unknown_ca(48)")
TlsChannelStream BC MongoSocketWriteException -> SSLException -> TlsFatalAlertReceived("unknown_ca(48)")

Key observation: FIN/RST get labeled correctly today because they don't match any exclusion check . A TLS alert from the server means it actively responded, so it wasn't shedding load.
Assumption: A rate limiter drops the TCP connection (FIN/RST) before or during a TLS exchange.

Two approaches

Both are heuristic-based -- perfect classification isn't possible since exception types and messages are provider implementation details.

A. Improved blocklist (current approach): Label by default, exclude what we can identify as not-overload.
Failure mode: unrecognized config errors stay labeled (pool preserved, unnecessary retries).

it should be better for unknown providers - if we can't classify an error, pool is preserved, which is the safer default under load.

B. Allowlist: Only label recognized I/O patterns (EOFException, SocketException("Connection reset"), ClosedChannelException).

Failure mode: unrecognized I/O errors are not labeled (pool cleared unnecessarily). This is the pre-backpressure behavior. For example, BC represents FIN as TlsFatalAlert("handshake_failure(40)") - the allowlist misses it, pool gets cleared.

TL;DR / Suggestion:

As we discussed over the call, the blocklist (Approach A) matches the CMAP spec's direction ("the pool MUST add the error labels" as default, with exclusions carved out).

However, we could check for all RFC alert description strings, so we can improve the heuristic.

All 25 handshake-only TLS alert descriptions from OpenJDK's Alert.java (RFC 8446/5246)

These are the alert codes with handshakeOnly=true in OpenJDK. When received from a peer, they produce SSLHandshakeException("Received fatal alert: <description>"). All are definitively config/protocol errors, not I/O:

Code Description string Currently caught by keywords?
40 handshake_failure Yes (handshake_failure)
41 no_certificate Yes (certificate)
42 bad_certificate Yes (certificate)
43 unsupported_certificate Yes (certificate)
44 certificate_revoked Yes (certificate)
45 certificate_expired Yes (certificate)
46 certificate_unknown Yes (certificate)
47 illegal_parameter No
48 unknown_ca No
49 access_denied No
50 decode_error No
51 decrypt_error No
60 export_restriction No
70 protocol_version Yes (protocol)
71 insufficient_security No
100 no_renegotiation No
109 missing_extension No
110 unsupported_extension No
111 certificate_unobtainable Yes (certificate)
112 unrecognized_name No
113 bad_certificate_status_response Yes (certificate)
114 bad_certificate_hash_value Yes (certificate)
115 unknown_psk_identity No
116 certificate_required Yes (certificate)
120 no_application_protocol No

We should keep the existing type checks (CertificateException, CertPathBuilderException, CertPathValidatorException, SSLPeerUnverifiedException, SSLProtocolException) - they handle local validation failures correctly on both providers (and should on others).

P.S Claude was used to examine JSSE providers and create a table of RFC messages used in OpenJDK's provider.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored for option A

Comment on lines +265 to +266
BackpressureErrorLabeler.applyLabelsIfEligible(exceptionToThrow)
assert !exceptionToThrow.hasErrorLabel(MongoException.SYSTEM_OVERLOADED_ERROR_LABEL)
Copy link
Copy Markdown
Member

@vbabanin vbabanin Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems better suited for the unit tests where the SUT is BackpressureLabeler (e.g., BackpressureErrorLabelerTest). In this test the SUT is server.getConnection(), so it would be preferable to avoid asserting labeling internals here.

Could we instead assert the “label not applied” outcome at the behavior level in the then: section (e.g., on def e = thrown(MongoException)) and keep the unit tests in BackpressureLabeler responsible for the labeling implementation details?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

nhachicha and others added 3 commits April 30, 2026 22:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nhachicha and others added 3 commits May 1, 2026 07:46
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
…ureErrorLabeler.java

Co-authored-by: Viacheslav Babanin <[email protected]>
- Add the complementary test to cover the suppress-on-label branch.
@nhachicha nhachicha requested a review from vbabanin May 1, 2026 08:51
Comment on lines +89 to +90
|| lowerMessage.contains("trust")
|| lowerMessage.contains("hostname")
Copy link
Copy Markdown
Member

@vbabanin vbabanin May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario produces an SSLHandshakeException with "trust", "hostname", or "pkix" in the message which is not covered by ? I couldn’t find these strings in the OpenJDK JSSE provider sources - are they coming from a different provider or JDK version?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some SO messages with these: ex1 & ex2 and ex3 but you can argue these are coming from the cause getMessage, which makes these checks empirical. I'm leaning towards keeping only the received fatal alert

@nhachicha nhachicha requested a review from vbabanin May 5, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants